Skip to content

fetch: pass transport to post-fetch connectivity check#2123

Closed
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:fetch-transport-fix
Closed

fetch: pass transport to post-fetch connectivity check#2123
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:fetch-transport-fix

Conversation

@spkrka
Copy link
Copy Markdown

@spkrka spkrka commented May 24, 2026

We're working on reducing git fetch times on a large monorepo (2.4M
commits, 374K files, 10.9K local refs). Profiling showed the post-fetch
connectivity check (rev-list --objects --stdin --not --all) dominating
wall time when there are new objects.

While investigating, I noticed that check_connected() already has a
fast path for self-contained packs — it uses find_pack_entry_one() to
skip refs whose objects are in the new pack. builtin/clone.c passes
the transport to enable this, but store_updated_refs() in
builtin/fetch.c does not, making the optimization dead code for
fetches.

The fix is a three-line change to thread the transport through.

cc: Kristofer Karlsson krka@spotify.com
cc: Jeff King peff@peff.net

When fetching with a transport that sets `self_contained_and_connected`
(as index-pack does for self-contained packs), check_connected() can
use find_pack_entry_one() to skip connectivity verification for refs
whose objects exist in the new pack. This avoids sending those OIDs to
the rev-list child process.

However, store_updated_refs() never passed the transport to
check_connected(), so opt.transport was always NULL and this
optimization was dead code for post-fetch connectivity checks.

Thread the transport parameter through store_updated_refs() and set
opt.transport so that check_connected() can take advantage of
self-contained packs.

On a large repository (2.4M commits, 374K files, 10.9K local refs),
fetching 200 new commits:

  Before: rev-list connectivity check  22s,  total fetch  36s
  After:  rev-list connectivity check   5s,  total fetch  14s

The remaining 5s is spent verifying refs not contained in the new pack.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
@spkrka spkrka marked this pull request as ready for review May 24, 2026 12:27
@spkrka
Copy link
Copy Markdown
Author

spkrka commented May 24, 2026

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

Submitted as pull.2123.git.1779625693328.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2123/spkrka/fetch-transport-fix-v1

To fetch this version to local tag pr-2123/spkrka/fetch-transport-fix-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2123/spkrka/fetch-transport-fix-v1

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Kristofer Karlsson <krka@spotify.com>
>
> When fetching with a transport that sets `self_contained_and_connected`
> (as index-pack does for self-contained packs), check_connected() can
> use find_pack_entry_one() to skip connectivity verification for refs
> whose objects exist in the new pack. This avoids sending those OIDs to
> the rev-list child process.
>
> However, store_updated_refs() never passed the transport to
> check_connected(), so opt.transport was always NULL and this
> optimization was dead code for post-fetch connectivity checks.
>
> Thread the transport parameter through store_updated_refs() and set
> opt.transport so that check_connected() can take advantage of
> self-contained packs.
>
> On a large repository (2.4M commits, 374K files, 10.9K local refs),
> fetching 200 new commits:
>
>   Before: rev-list connectivity check  22s,  total fetch  36s
>   After:  rev-list connectivity check   5s,  total fetch  14s
>
> The remaining 5s is spent verifying refs not contained in the new pack.

Impressive.

The check_connected() function itself is a battle tested helper
function, with the optimization that originates in c6807a40 (clone:
open a shortcut for connectivity check, 2013-05-26), and then
polished in 26b974b3 (check_connected(): delay opening new_pack,
2026-03-05), allowing available "transport" to be taken into account
does make very good sense.

The other call to check_connected() that appear in builtin/fetch.c
does not pass opt.transport, either, but this one checks before we
even fetch any packs over any transport, so a tweak similar to this
patch would not help that code path, I guess.  In fact, many calls
to check_connected() elsewhere use opt that is often local to the
scope, that do not have transport at all.  I wonder if there are
some of them that benefit from a similar tweak?

Thanks.


>
> Signed-off-by: Kristofer Karlsson <krka@spotify.com>
> ---
>     fetch: pass transport to post-fetch connectivity check
>     
>     We're working on reducing git fetch times on a large monorepo (2.4M
>     commits, 374K files, 10.9K local refs). Profiling showed the post-fetch
>     connectivity check (rev-list --objects --stdin --not --all) dominating
>     wall time when there are new objects.
>     
>     While investigating, I noticed that check_connected() already has a fast
>     path for self-contained packs — it uses find_pack_entry_one() to skip
>     refs whose objects are in the new pack. builtin/clone.c passes the
>     transport to enable this, but store_updated_refs() in builtin/fetch.c
>     does not, making the optimization dead code for fetches.
>     
>     The fix is a three-line change to thread the transport through.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2123%2Fspkrka%2Ffetch-transport-fix-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2123/spkrka/fetch-transport-fix-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/2123
>
>  builtin/fetch.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index a22c319467..647fd1c30c 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1213,6 +1213,7 @@ N_("it took %.2f seconds to check forced updates; you can use\n"
>     "to avoid this check\n");
>  
>  static int store_updated_refs(struct display_state *display_state,
> +			      struct transport *transport,
>  			      int connectivity_checked,
>  			      struct ref_transaction *transaction, struct ref *ref_map,
>  			      struct fetch_head *fetch_head,
> @@ -1228,6 +1229,7 @@ static int store_updated_refs(struct display_state *display_state,
>  	if (!connectivity_checked) {
>  		struct check_connected_options opt = CHECK_CONNECTED_INIT;
>  
> +		opt.transport = transport;
>  		opt.exclude_hidden_refs_section = "fetch";
>  		rm = ref_map;
>  		if (check_connected(iterate_ref_map, &rm, &opt)) {
> @@ -1432,7 +1434,7 @@ static int fetch_and_consume_refs(struct display_state *display_state,
>  	}
>  
>  	trace2_region_enter("fetch", "consume_refs", the_repository);
> -	ret = store_updated_refs(display_state, connectivity_checked,
> +	ret = store_updated_refs(display_state, transport, connectivity_checked,
>  				 transaction, ref_map, fetch_head, config,
>  				 display_array);
>  	trace2_region_leave("fetch", "consume_refs", the_repository);
>
> base-commit: 6a4418c36d6bad69a599044b3cf49dcbd049cb45

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

Kristofer Karlsson wrote on the Git mailing list (how to reply to this email):

Good catch! After finding this case, I looked into the other related
call sites but found that they are already correct as-is:
- builtin/clone.c - already passes opt.transport (this is where I
copied it from)
- builtin/receive-pack.c (3 calls) - no transport object available to propagate
- fetch-pack.c - only used for the --deepen path, which sets
connectivity_checked when it passes,
  so the store_updated_refs() check is skipped entirely and transport
is not needed
- bundle.c - no need for transport

I am not 100% sure, but I suppose it's always possible to follow up
with more reuse of this later.

- Kristofer

On Sun, 24 May 2026 at 14:53, Junio C Hamano <gitster@pobox.com> wrote:
>
> "Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > From: Kristofer Karlsson <krka@spotify.com>
> >
> > When fetching with a transport that sets `self_contained_and_connected`
> > (as index-pack does for self-contained packs), check_connected() can
> > use find_pack_entry_one() to skip connectivity verification for refs
> > whose objects exist in the new pack. This avoids sending those OIDs to
> > the rev-list child process.
> >
> > However, store_updated_refs() never passed the transport to
> > check_connected(), so opt.transport was always NULL and this
> > optimization was dead code for post-fetch connectivity checks.
> >
> > Thread the transport parameter through store_updated_refs() and set
> > opt.transport so that check_connected() can take advantage of
> > self-contained packs.
> >
> > On a large repository (2.4M commits, 374K files, 10.9K local refs),
> > fetching 200 new commits:
> >
> >   Before: rev-list connectivity check  22s,  total fetch  36s
> >   After:  rev-list connectivity check   5s,  total fetch  14s
> >
> > The remaining 5s is spent verifying refs not contained in the new pack.
>
> Impressive.
>
> The check_connected() function itself is a battle tested helper
> function, with the optimization that originates in c6807a40 (clone:
> open a shortcut for connectivity check, 2013-05-26), and then
> polished in 26b974b3 (check_connected(): delay opening new_pack,
> 2026-03-05), allowing available "transport" to be taken into account
> does make very good sense.
>
> The other call to check_connected() that appear in builtin/fetch.c
> does not pass opt.transport, either, but this one checks before we
> even fetch any packs over any transport, so a tweak similar to this
> patch would not help that code path, I guess.  In fact, many calls
> to check_connected() elsewhere use opt that is often local to the
> scope, that do not have transport at all.  I wonder if there are
> some of them that benefit from a similar tweak?
>
> Thanks.
>
>
> >
> > Signed-off-by: Kristofer Karlsson <krka@spotify.com>
> > ---
> >     fetch: pass transport to post-fetch connectivity check
> >
> >     We're working on reducing git fetch times on a large monorepo (2.4M
> >     commits, 374K files, 10.9K local refs). Profiling showed the post-fetch
> >     connectivity check (rev-list --objects --stdin --not --all) dominating
> >     wall time when there are new objects.
> >
> >     While investigating, I noticed that check_connected() already has a fast
> >     path for self-contained packs — it uses find_pack_entry_one() to skip
> >     refs whose objects are in the new pack. builtin/clone.c passes the
> >     transport to enable this, but store_updated_refs() in builtin/fetch.c
> >     does not, making the optimization dead code for fetches.
> >
> >     The fix is a three-line change to thread the transport through.
> >
> > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2123%2Fspkrka%2Ffetch-transport-fix-v1
> > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2123/spkrka/fetch-transport-fix-v1
> > Pull-Request: https://github.com/gitgitgadget/git/pull/2123
> >
> >  builtin/fetch.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/builtin/fetch.c b/builtin/fetch.c
> > index a22c319467..647fd1c30c 100644
> > --- a/builtin/fetch.c
> > +++ b/builtin/fetch.c
> > @@ -1213,6 +1213,7 @@ N_("it took %.2f seconds to check forced updates; you can use\n"
> >     "to avoid this check\n");
> >
> >  static int store_updated_refs(struct display_state *display_state,
> > +                           struct transport *transport,
> >                             int connectivity_checked,
> >                             struct ref_transaction *transaction, struct ref *ref_map,
> >                             struct fetch_head *fetch_head,
> > @@ -1228,6 +1229,7 @@ static int store_updated_refs(struct display_state *display_state,
> >       if (!connectivity_checked) {
> >               struct check_connected_options opt = CHECK_CONNECTED_INIT;
> >
> > +             opt.transport = transport;
> >               opt.exclude_hidden_refs_section = "fetch";
> >               rm = ref_map;
> >               if (check_connected(iterate_ref_map, &rm, &opt)) {
> > @@ -1432,7 +1434,7 @@ static int fetch_and_consume_refs(struct display_state *display_state,
> >       }
> >
> >       trace2_region_enter("fetch", "consume_refs", the_repository);
> > -     ret = store_updated_refs(display_state, connectivity_checked,
> > +     ret = store_updated_refs(display_state, transport, connectivity_checked,
> >                                transaction, ref_map, fetch_head, config,
> >                                display_array);
> >       trace2_region_leave("fetch", "consume_refs", the_repository);
> >
> > base-commit: 6a4418c36d6bad69a599044b3cf49dcbd049cb45

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

User Kristofer Karlsson <krka@spotify.com> has been added to the cc: list.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 25, 2026

This patch series was integrated into seen via git@d4cab0c.

@gitgitgadget gitgitgadget Bot added the seen label May 25, 2026
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 25, 2026

This patch series was integrated into seen via git@a96d949.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 26, 2026

This branch is now known as kk/fetch-store-ref-optimization.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 26, 2026

This patch series was integrated into seen via git@5374b03.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 26, 2026

There was a status update in the "New Topics" section about the branch kk/fetch-store-ref-optimization on the Git mailing list:

When fetching from a transport that provides a self-contained pack,
pass the transport pointer to the post-fetch `check_connected()` call
to optimize connectivity check.

Will merge to 'next'?
source: <pull.2123.git.1779625693328.gitgitgadget@gmail.com>

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

This patch series was integrated into seen via git@434c7f0.

@spkrka spkrka closed this May 27, 2026
@spkrka spkrka deleted the fetch-transport-fix branch May 27, 2026 10:05
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

Jeff King wrote on the Git mailing list (how to reply to this email):

On Sun, May 24, 2026 at 12:28:12PM +0000, Kristofer Karlsson via GitGitGadget wrote:

> From: Kristofer Karlsson <krka@spotify.com>
> 
> When fetching with a transport that sets `self_contained_and_connected`
> (as index-pack does for self-contained packs), check_connected() can
> use find_pack_entry_one() to skip connectivity verification for refs
> whose objects exist in the new pack. This avoids sending those OIDs to
> the rev-list child process.
> 
> However, store_updated_refs() never passed the transport to
> check_connected(), so opt.transport was always NULL and this
> optimization was dead code for post-fetch connectivity checks.
> 
> Thread the transport parameter through store_updated_refs() and set
> opt.transport so that check_connected() can take advantage of
> self-contained packs.

That makes sense in principle, but one thing puzzles me. We only turn on
the optimization in check_connected() if the transport's smart_options
has the self_contained_and_connected bit set. And we set that only when
we were told via check_self_contained_and_connected to do so (and we
pass the appropriate option to index-pack, which tells us the result is
OK).

And the only place that turns on check_self_contained_and_connected is
in builtin/clone.c. So how does this optimization work for a non-clone
fetch? Am I missing some code path?

-Peff

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

User Jeff King <peff@peff.net> has been added to the cc: list.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

Kristofer Karlsson wrote on the Git mailing list (how to reply to this email):

You're right. I dug into this further and realized the problem is deeper
than just the flag not being set in builtin/fetch.c.

Even if we add:
transport->smart_options->check_self_contained_and_connected = 1;
to prepare_transport(), the optimization still won't work for fetches.

The optimization is fundamentally clone-only.

I was unable to reproduce the benchmark numbers from my original commit
message. The patch as submitted is indeed inert for non-clone fetches.
It looked like a simple improvement, but it's clear that it was incorrect.
I'll drop it, and I apologize for the noise here.

-- Kristofer

On Wed, 27 May 2026 at 10:32, Jeff King <peff@peff.net> wrote:
>
> On Sun, May 24, 2026 at 12:28:12PM +0000, Kristofer Karlsson via GitGitGadget wrote:
>
> > From: Kristofer Karlsson <krka@spotify.com>
> >
> > When fetching with a transport that sets `self_contained_and_connected`
> > (as index-pack does for self-contained packs), check_connected() can
> > use find_pack_entry_one() to skip connectivity verification for refs
> > whose objects exist in the new pack. This avoids sending those OIDs to
> > the rev-list child process.
> >
> > However, store_updated_refs() never passed the transport to
> > check_connected(), so opt.transport was always NULL and this
> > optimization was dead code for post-fetch connectivity checks.
> >
> > Thread the transport parameter through store_updated_refs() and set
> > opt.transport so that check_connected() can take advantage of
> > self-contained packs.
>
> That makes sense in principle, but one thing puzzles me. We only turn on
> the optimization in check_connected() if the transport's smart_options
> has the self_contained_and_connected bit set. And we set that only when
> we were told via check_self_contained_and_connected to do so (and we
> pass the appropriate option to index-pack, which tells us the result is
> OK).
>
> And the only place that turns on check_self_contained_and_connected is
> in builtin/clone.c. So how does this optimization work for a non-clone
> fetch? Am I missing some code path?
>
> -Peff

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

Jeff King wrote on the Git mailing list (how to reply to this email):

On Wed, May 27, 2026 at 12:04:19PM +0200, Kristofer Karlsson wrote:

> You're right. I dug into this further and realized the problem is deeper
> than just the flag not being set in builtin/fetch.c.
> 
> Even if we add:
> transport->smart_options->check_self_contained_and_connected = 1;
> to prepare_transport(), the optimization still won't work for fetches.
> 
> The optimization is fundamentally clone-only.

I have wondered if the transport could do the same thing for:

  git init
  git fetch ...

When we do not send any "want", then we'd expect the pack we receive to
be self-contained.

But in practice that is not that exciting, as it is a special case that
does not come up that often.

I suspect there's some hybrid mode where we could save some work. It is
easy for index-pack to come up with a list of "edges" from the pack it
got that point outside of the pack. We just need to know that those
edges are reachable from existing refs. So really we could be checking
the connectivity of those edges, rather than the actual ref tips.

Would that be less work? I'm not sure. It saves walking over the
newly-fetched history, but in practice that is probably not that
expensive. It potentially saves a lot when the edge is a ref tip; for a
true fast-forward we'd see a ref going from A..B, and if index-pack
tells us that it just needs A, we can skip the traversal entirely.

But index-pack isn't really thinking in terms of commits, but rather the
whole object graph. So you're going to find that commit A is needed, but
also all of the tree entries in the existing history that weren't
touched by the new history (e.g., B touched path "foo" but not "bar", so
it gets a new top-level tree, a new blob for "foo", but still references
the existing blob for "bar"). I guess you could speculatively load A^{tree}
to cull the list.

So I dunno. I think there is some room for speedup here, and in many
common cases you could skip the rev-list invocation entirely. But it's
not trivial, and I think is far afield from what your patch was
originally trying to do. ;)

> I was unable to reproduce the benchmark numbers from my original commit
> message.

Yeah, I wondered where the numbers came from. It is very easy to fool
yourself with fetch benchmarks, because even "fetch --dry-run" will
transfer objects, and under the hood we try to optimize out as much
object transfer as possible. So you really have to start from the exact
same on-disk state for each trial.

> The patch as submitted is indeed inert for non-clone fetches.
> It looked like a simple improvement, but it's clear that it was incorrect.
> I'll drop it, and I apologize for the noise here.

No problem. You've been generating some interesting optimization work
lately, so I can't complain. :)

I'll probably have more comments on your other topics, but I'm out of
time for tonight.

-Peff

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

There was a status update in the "Cooking" section about the branch kk/fetch-store-ref-optimization on the Git mailing list:

When fetching from a transport that provides a self-contained pack,
pass the transport pointer to the post-fetch `check_connected()` call
to optimize connectivity check.

Will merge to 'next'?
source: <pull.2123.git.1779625693328.gitgitgadget@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant